receptive field
BeyondMix: Leveraging Structural Priors and Long-Range Dependencies for Domain-Invariant LiDARSegmentation
Domain adaptation for LiDAR semantic segmentation remains challenging due to the complex structural properties of point cloud data. While mix-based paradigms have shown promise, they often fail to fully leverage the rich structural priors inherent in 3DLiDAR point clouds. In this paper, we identify three critical yet underexploited structural priors: permutation invariance, local consistency, and geometric consistency. We introduce BeyondMix, a novel framework that harnesses the capabilities of State Space Models (specifically Mamba) to construct and exploit these structural priors while modeling long-range dependencies that transcend the limited receptive fields of conventional voxel-based approaches. By employing space-filling curves to impose sequential ordering on point cloud data and implementing strategic spatial partitioning schemes, BeyondMix effectively captures domain-invariant representations. Extensive experiments on challenging LiDAR semantic segmentation benchmarks demonstrate that our approach consistently outperforms existing state-of-the-art methods, establishing a new paradigm for unsupervised domain adaptation in 3D point cloud understanding.
Unveiling the Spatial-temporal Effective Receptive Fields of Spiking Neural Networks
Spiking Neural Networks (SNNs) demonstrate significant potential for energyefficient neuromorphic computing through an event-driven paradigm. While training methods and computational models have greatly advanced, SNNs struggle to achieve competitive performance in visual long-sequence modeling tasks. In artificial neural networks, the effective receptive field (ERF) serves as a valuable tool for analyzing feature extraction capabilities in visual long-sequence modeling. Inspired by this, we introduce the Spatio-Temporal Effective Receptive Field (ST-ERF) to analyze the ERF distributions across various Transformer-based SNNs. Based on the proposed ST-ERF, we reveal that these models suffer from establishing a robust global ST-ERF, thereby limiting their visual feature modeling capabilities. To overcome this issue, we propose two novel channel-mixer architectures: multilayer-perceptron-based mixer (MLPixer) and splash-and-reconstruct block (SRB). These architectures enhance global spatial ERF through all timesteps in early network stages of Transformer-based SNNs, improving performance on challenging visual long-sequence modeling tasks. Extensive experiments conducted on the Meta-SDT variants and across object detection and semantic segmentation tasks further validate the effectiveness of our proposed method. Beyond these specific applications, we believe the proposed ST-ERF framework can provide valuable insights for designing and optimizing SNN architectures across a broader range of tasks.
Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry
Sparse Autoencoders (SAEs) are widely used to interpret neural networks by identifying meaningful concepts from their representations. However, do SAEs truly uncover all concepts a model relies on, or are they inherently biased toward certain kinds of concepts? We introduce a unified framework that recasts SAEs as solutions to a bilevel optimization problem, revealing a fundamental challenge: each SAE imposes structural assumptions about how concepts are encoded in model representations, which in turn shapes what it can and cannot detect. This means different SAEs are not interchangeable--switching architectures can expose entirely new concepts or obscure existing ones. To systematically probe this effect, we evaluate SAEs across a spectrum of settings: from controlled toy models that isolate key variables, to semi-synthetic experiments on real model activations and finally to large-scale, naturalistic datasets. Across this progression, we examine two fundamental properties that real-world concepts often exhibit: heterogeneity in intrinsic dimensionality (some concepts are inherently low-dimensional, others are not) and nonlinear separability. We show that SAEs fail to recover concepts when these properties are ignored, and we design a new SAE that explicitly incorporates both, enabling the discovery of previously hidden concepts and reinforcing our theoretical insights. Our findings challenge the idea of a universal SAE and underscores the need for architecture-specific choices in model interpretability.
Focal Modulation Networks
We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a focal modulation module for modeling token interactions in vision. Focal modulation comprises three components: (i)hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges, (ii) gated aggregation to selectively gather contexts for each query token based on its content, and (iii) element-wise modulation or affine transformation to fuse the aggregated context into the query. Extensive experiments show FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin and Focal Transformers) with similar computational cost on the tasks of image classification, object detection, and semantic segmentation. Specifically, FocalNets with tiny and base size achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K.
MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning
Tiny deep learning on microcontroller units (MCUs) is challenging due to the limited memory size. We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs: the first several blocks have an order of magnitude larger memory usage than the rest of the network. To alleviate this issue, we propose a generic patch-by-patch inference scheduling, which operates only on a small spatial region of the feature map and significantly cuts down the peak memory. However, naive implementation brings overlapping patches and computation overhead. We further propose receptive field redistribution to shift the receptive field and FLOPs to the later stage and reduce the computation overhead. Manually redistributing the receptive field is difficult.
Accuracy [% ] Elastic Transform 1 2 3 4 5 0 20
Here we compute the mean and standard deviation across seeds. Model Robustness score Baseline 100% MTL with real responses 109% MTL with predicted responses (MTL-Monkey) 118% MTL with shuffled predicted responses (MTL-Shuffled) 98% Table 3: Comparing our MTL model co-trained on predicted neural responses -MTL-Monkey in the paper-to the MTL model co-trained directly on real monkey V1 responses. We computed the robustness score of each model after averaging the accuracies of 3 seeds per model for each corruption type in TIN-TC and normalizing against the baseline test accuracies, i.e. the baseline score is 100%. We find that we can obtain a general increase in robustness when using real neural data. However, co-training on predicted neural responses improves the robustness of the models even more.